Optimal Galaxy Distance Estimators 



M.A. Hendry^ and J.F.L. Simmons^ 
^ Astronomy Centre, University of Sussex, Brighton BNl 9QH, UK. 

Department of Physics and Astronomy, University of Glasgow, Glasgow G12 

8QQ, UK. 



Abstract 

The statistical properties of galaxy distance estimators corresponding 
to the Tully-Fisher and — a relations are studied, and a rigorous frame- 
work for identifying and removing the effects of Malmquist bias due to 
observational selection is developed. The prescription of Schechter (1980) 
for defining unbiased distance estimators is verified and extended to more 
general - and more realistic - cases. Finally, the derivation of 'optimal' 
unbiased estimators of minimum dispersion, by utilising information from 
additional suitably correlated observables, is discussed and the results 
applied to a calibrating sample from the Fornax cluster, as used in the 
Mathewson spiral galaxy redshift survey. The optimal distance estima- 
tor derived from apparent magnitude, diameter and 21cm line width has 
an intrinsic scatter which is 25 % smaller than that of the Tully-Fisher 
relation for this calibrating sample. 



1 INTRODUCTION 



In recent years the analysis of redshift surveys of galaxies has made a significant 
contribution to our emerging understanding of the formation and evolution of 
large scale structure in the universe. A crucial element in this analysis is the 
accurate estimation of galaxy distances, and an important feature of many re- 
cent surveys has been the availability of redshift independent distance indicators 
which allow one to determine directly an estimate of the radial peculiar velocity 
of each galaxy in the survey. By far the most prevalent examples of such distance 
indicators are the Tully-Fisher and — a relations. These have been used by 
a number of authors in attempts to reconstruct, from various redshift surveys. 



the full 3-dimensional peculiar velocity and density contrast fields (c.f. |jT4 



T7|). This work has been at the forefront of a mounting body of evidence 



in support of galaxy clustering and coherent streaming motions on scales of the 
order of 100 Mpc; evidence which, nevertheless, has attracted considerable con- 
troversy in the literature - not least because of the difficulties which it presents 
for currently popular theories of structure formation. Much of this debate has 
focussed upon the statistical properties of the Tully-Fisher and — a relations, 
and the extent to which detections of galaxy streaming might be a statistical 
artefact of the distance indicators. 

The aim of this paper is to address and clarify several statistical issues re- 
lating to the use of redshift independent distance indicators, particularly with 
respect to the systematic biases which arise in surveys subject to observational 
selection. These systematic effects have been referred to generically in the litera- 
ture as 'Malmquist bias', although there exists a lack of consensus as to precisely 
what is meant by this term - and consequently some disagreement over how one 
should best deal with its effects in analyses of galaxy redshift surveys. In this 
paper we identify Malmquist bias and examine its effects on redshift independent 
distance indicators by following the statistical formalism which we adopted pre- 
viously in this context in [|l^ (hereafter HS). In particular we examine in what 
circumstances Malmquist bias may be eliminated completely from redshift inde- 
pendent distance indicators, thus defining what one might regard as an 'optimal' 
galaxy distance estimator. We will also consider the statistical basis of other 
approaches to Malmquist bias which have been adopted in the literature (c.f. 

|]T3[), and clarify the important differences between these approaches and 
the formalism which we adopt here. In a concurrent paper |T8| we examine in 



detail the consequences of using biased distance indicators for reconstructing the 
large scale velocity and density fields - particularly with respect to the potent 
method 0, g. 

The Tully-Fisher and — a relations are both derived empirically, by fitting 
a power law to the relationship between two intrinsic physical characteristics of 
galaxies: the luminosity and the width of the HI 21cm line of spirals in the 
case of Tully-Fisher, and the intrinsic diameter and central velocity dispersion of 
ellipticals in the case of — a . Both relations are generally expressed in terms 
of log quantities, and are thus fitted to be linear in form - e.g. for Tully-Fisher 
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we have an expression of the form:- 



M = a\ogW + b (1) 

where M is the absolute magnitude and W the 21cm hne width. The constants 
a and b, the slope and zero-point of the relation, are determined empirically - 
usually with a calibrating sample of reference galaxies the distances of which 
have been measured independent ly.[] To apply the relation one simply measures 
the line width of a given galaxy, and infers from equation [l| an estimate of its 
absolute magnitude. This can then be combined with the galaxy's observed 
apparent magnitude to obtain an estimate of its distance. 

Finding the 'best' values of the constants a and b has been a thorny issue 
in the literature for a number of years. The straight line relationship given by 
equation |I] is generally fitted by performing a linear regression on the calibrating 
sample. The question of which linear regression is most appropriate is non- 
trivial, however, particularly when the one's survey is subject to observational 
selection effects - a fact which has been widely recognised (c.f. |^, [ p7[1 . 



12 1, 10). We can illustrate this with the following simple example. Figure (1) 
represents schematically the typical scatter of the Tully-Fisher relation, assuming 
that absolute magnitude and log line width are random variables whose joint 
distribution is bivariate normal. (More precisely, the ellipse shown in Figure (1) 
is an isoprobability contour enclosing a given confidence region for magnitude 
and log line width). The solid and dotted lines indicate the linear relationships 
obtained by regressing line widths on magnitudes and magnitudes on line widths 
respectively. Thus the dotted line is the mean, or expected , value of absolute 
magnitude conditional upon log line width. Conversely the solid line is the 
expected log line width conditional upon absolute magnitude. Since in practice 
one wishes to infer the value of M from the measured line width, the regression 
of magnitudes on line widths is generally referred to as the 'direct' Tully-Fisher 
relation, while regressing line widths on magnitudes is often termed the 'inverse' 



Tully-Fisher relation. Introducing P as a shorthand for log line width (c.f. p7| , 
[0), the following equations define the direct and inverse regression lines for 
the bivariate normal case:- 

S(M|P) = Mo + p— (M-Mo) (2) 

CTp 

E{P\M) = Po + p^{P-Po) (3) 

where Mq, Pq, (Tm, cTp and p denote the means, dispersions and correlation co- 
efficient of the bivariate normal distribution of magnitudes and log line widths. 
Note that we have also adopted the standard statistical convention of denoting 
random variables by bold face characters. 

It follows from equations ^ and § that both the direct and inverse regression 
lines can be used to infer an estimate of the absolute magnitude of a given galaxy 

-"^In some analyses of redshift surveys, c.f. the slope and zero point are fitted simulta- 
neously with the parameters of a specific velocity field model, using all of the survey galaxies. 
We will consider the specific statistical issues raised by this contrasting approach elsewhere. 
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which is a hnear function of its measured log hne width, although the constants a 
and b in equation |l| will clearly be different in each case. Moreover the definition 
of the estimate of M inferred from each regression line is also subtly different. 
For the direct regression line the estimate of M is the mean absolute magnitude 
at the observed log line width - i.e. i?(M|P = PobJ- For the inverse regression 
line, on the other hand, the estimated absolute magnitude is the value of M such 
that the mean log line width conditional upon M is equal to its observed value 

- i.e. E(P\M.) = Fobs- Consequently - as indeed is apparent from their slopes - 
the two lines give rise to markedly different distance estimators. 

The situation becomes more complex when we include the effects of observa- 
tional selection. Figure (2) shows schematically the distribution of M and P for 
observable galaxies in a sample subject to a sharp cut-off in absolute magnitude 

- as would be the case for e.g. a more distant cluster observed in an apparent 
magnitude limited survey. We can see that in this case the expected value of M 
conditional on P is dramatically different from the direct regression line for the 
complete sample: in fact E(M.\P) is no longer linear in P but curves sharply as 
M approaches the magnitude limit. 

This means that if one calibrates the Tully-Fisher relation in the nearby 
cluster using the direct regression line, and then applies this relation to estimate 
the distance of a more distant cluster (or indeed a distant field galaxy), one 
will systematically underestimate its distance because the expected value of M 
given P in the more distant sample is systematically brighter than the value 
predicted by the direct regression line. It is essentially this systematic error or 
bias in the inferred distance which we identify as 'Malmquist bias', although we 
will define more rigorously what we mean by the bias of a distance estimator in 
section || below. The bias is precisely analogous to the effect identified by |15 



in considering the mean absolute magnitude of observable 'standard candles' 
brighter than some given apparent magnitude limit. The effect of Malmquist 
bias upon the Tully-Fisher relation has been illustrated in a similar manner to 
Figures (1) and (2) by a number of different authors (c.f. |T^, [Q, 0). 



For the case where one's sample is subject only to luminosity selection p3 



recognised that the slope of the inverse regression line is unchanged, irrespective 
of the magnitude-completeness of one's sample. In other words this regression 
line is free from the Malmquist effect, and it may therefore be used to provide 
an unbiased galaxy distance estimate. Although the unbiased property of the 
'Schechter' inverse regression line has been generally recognised, its ramifications 
for estimating galaxy distances have not - it would seem to us - been fully 
appreciated, and the application of Schechter's ideas to more realistic situations 
has not been fully explored. Such an extension forms the central aim of this 
paper. We set out to place the Schechter result on a rigorous statistical footing, 
following the same formalism previously developed in HS, in order to confirm its 
range of validity, examine the assumptions upon which it depends and consider 
to what extent those assumptions may be generalised. 



^Schechter's original treatment was for the Faber Jackson relation between luminosity and 
velocity dispersion for ellipticals, although he noted that precisely the same principle held for 
Tully-Fisher. 
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To this end, in section we study the properties of a general distance estima- 
toi0 formed from an arbitrary hnear combination of apparent magnitude and log 
line width. It is easy to see that whichever linear regression one adopts the cor- 
responding distance estimator will take this simple linear form, since one will al- 
ways infer the absolute magnitude of a given galaxy as a linear function of log(line 
width). Our analysis is carried out in the first instance for the Tully-Fisher case, 
with the corresponding results for D^ — a indicated where appropriate. 

2 PROPERTIES OF GENERAL LINEAR ESTIMATOR 

2.1 WHAT DO WE MEAN BY MALMQUIST BIAS? 

The approach which we adopt here is a natural extension of the formalism de- 
veloped in HS to study the properties of distance estimators which are functions 
of only one observable - apparent magnitude. Before we proceed in earnest we 
recall from HS a rigorous definition of what we mean by the bias of an esti- 
mator, and clarify the differences between this approach and other treatments 
of Malmquist bias in the literature. The contrasting approaches can be classed 
as belonging to one of two categories: 'frequentist' and 'Bayesian'. These terms 
reflect the fact that at the heart of the difference between the two approaches 
lies the long-standing dichotomy between a Bayesian and frequentist view of the 
nature of probability. 

The frequentist picture is essentially based on the intuitively familiar concept 
that the probability of an event measures the relative frequency of that event 
occurring in a large number of repeated experiments or trials. In the limit as 
the number of trials tends to infinity a histogram of relative frequencies tends 
to the probability density function (pdf) of a random variable - in this case our 
galaxy distance estimator. Crucial to the frequentist approach is the idea that 
the true distance of the galaxy in every trial is a fixed, though of course unknown, 
parameter - an 'unknown state of nature' in the usual statistical terminology. 

We can state these ideas more rigorously as follows, taking as an illustration 
the case of an estimator of log distance since we have seen in section [l| that 
such an estimator arises naturally from the Tully-Fisher and — a relations. 
Estimators of log distance will be the focus of our analysis for most of this paper, 
although similar remarks will clearly also apply to an estimator of distance or 
any other parameter. 

Suppose that is the true log distance of a given galaxy. Let w denote an 
estimator of Wo- (Following the standard convention we denote an estimator of 
a parameter by a caret). Let p{yf\wo) denote the pdf of w, given the true value 
of Wq. One defines w to be unbiased if the expected value of w is equal to Wq. 
In general the bias, B, of w at true log distance Wq is given by:- 

B{w,Wo) = J wp(w\wo)dw — Wq (4) 
■^More correctly, an estimator of log distance - a point to which we return presently 
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Another important quantity which one can introduce is the mean square error 
or risk , i?, of an estimator, defined by:- 



R{w,Wo) = J {w - WoY p{w\wo)dw (5) 

(c.f. eqs. (16) and (17) of HS). Note that for an unbiased estimator, the risk is 
identically equal to the variance. Note also that both the bias and risk are in 
general functions of the true log distance, Wq. This fact indicates the essential 
difficulty of completely removing Malmquist bias from galaxy distance estima- 
tors: the magnitude of the bias for any given galaxy in general depends upon its 
true distance, which is unknown. 

The definition of the bias of an estimator given by equation ^ differs from 
that adopted in those treatments of Malmquist bias which we may categorise 



as Bayesian - most notably the derivation of 'Malmquist corrections' in [|I4 



(hereafter LB) and |T3| (hereafter LS). In the Bayesian picture one regards the 
true log distance of the sampled galaxies itself as a random variable, to which one 
can ascribe some prior probability distribution, p(wo), based upon an assumed 
spatial density distribution and selection function, (note that Wq is now written 
in bold face). Following the measurement of the log distance estimator, w, for 
each galaxy one can define a posterior distribution, p(wo|w), for Wq conditional 
upon w which will differ from the prior. It is the properties of this posterior 
distribution which LB and LS consider in defining an estimator as unbiased. By 
applying Bayes' theorem one can derive an expression for p(wo|w), viz:- 

/ p(w|wo)p(wo) 
p Wo W = -— — — — — — — (6) 

J J9(w|Wo)p(Wo)dWo 

where the likelihood function, p(w|wo), is simply the pdf of w conditional on 
true log distance Wp, as in equation | above. 

LB and LS define w as unbiased if the expected value of Wq with respect to 
the posterior distribution, p(wo|w), is equal to w. In general the bias of w is 
defined by:- 

B{w,Wo) = J Wop(wo|w)dwo - w (7) 

By assuming a posterior distribution and likelihood function LB and LS derive 
a Malmquist correction to remove the bias of their 'raw' log distance estimator 
(which they denote by 1^), so that the corrected estimator is unbiased. 

The question of which approach one should take to the definition (to say 
nothing of the elimination!) of Malmquist bias is far from clear-cut, and depends 
strongly upon the context in which galaxy distance estimators are being used. 
For example develop their potent distance error analysis from the Bayesian 
viewpoint and apply Malmquist corrections to their raw distance estimates. They 
argue that this approach is essential to their analysis due to the nature of the 
smoothing procedures carried out in potent. 



We study the effects of biased distance estimators on potent in |]T8|, and 
give a full discussion of the broader statistical issues relating to the merits of 
the frequentist and Bayesian descriptions elsewhere (c.f. P3|, [|n|). Although we 
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concentrate on the frequentist description of Malmquist bias for the remainder 
of this paper, our results nevertheless have crucial implications for the Bayesian 
approach. This is because the Malmquist corrections defined in LB and LS are 
derived on the assumption of a raw log distance estimator, le, for each galaxy 
which is normally distributed with mean value equal to the true log distance. If 
this condition is not met then the Malmquist corrections derived from 1^ will not 



eliminate Malmquist bias [19 



In short, then, LB and LS make the crucial assumption that /g is unbiased in 
the frequentist sense, in order to define a corrected estimator which is unbiased 
in the Bayesian sense. It is this fact which makes our discussion of (frequentist) 
unbiased estimators in this paper extremely important for both frequentist and 
Bayesian approaches to Malmquist bias. 

Rather than assuming a pdf for our log distance estimator as in LB and LS, 
in section |2.2| we now derive the pdf in terms of the intrinsic joint distribution 
of absolute magnitude and line widths, and the observational selection effects. 



2.2 THE OBSERVED DISTRIBUTION OF M AND P 

Let the absolute magnitude, M, and spatial position, r, of a galaxy be random 
variables. Suppose we now introduce a third random variable, P, which denotes 
some intrinsic physical characteristic of the galaxy such that the measured value 
of P in general provides information on the value of M - i.e. M and P are 
correlated. It is convenient to identify P explicitly as log line width, as we have 
been doing up until now, although one should bear in mind that the formalism 
holds more generally for any suitably correlated physical variable. 

Suppose next that neither M nor P is correlated with r, so that we may 
meaningfully introduce \E'(M,P), the intrinsic joint distribution of M and P, 
which is independent of spatial position. Let A^(M, P, r)dM(iP(i'\/ denote the 
actual number of galaxies in volume element dV at spatial position r with abso- 
lute magnitude in the range M to M + dM and log line width in the range P to 
P + dP. It then follows that:- 

N{M, P,r)dMdPdV = ^(M, P)n{r)dMdPdV (8) 

where n{r) is the number density of galaxies at r. 

Consider now the joint distribution, p(M, P, r), of M, P and r, for observable 
galaxies in a sample subject to observational selection effects. We characterise 
the selection effects by a selection function, S(M, P, r), defined as the probability 
that a galaxy of absolute magnitude, M, and log line width, P, at spatial position, 
r, would be observable. 

An expression for p(M, P, r) in terms of \E'(M, P)n{r)dMdPdV and n{r) now 
follows easily:- 

pfM P r) = ^(M,P)n(,)^(M,P,,) 

' JJJ^{M,P)n{r)S{M,P,r)dMdPdV ^ ^ 

Note that the selection function, S'(M,P,r), does not measure the probability 
that a galaxy would actually be observed: clearly this would depend on the 



7 



true local number density of galaxies, n(r), which will in general be unknown. 
S'(M,P,r) as defined here will be independent of n{r) and, moreover, will also 
be independent of direction provided that one has corrected for the directional 
dependence of galactic extinction. A number of standard observational methods 
exist for carrying out these corrections (c.f. |Q, [|]). 

Because S'(M,P,r) is defined independently of direction it is meaningful to 
consider the distribution, (^(M, P|ro), of absolute magnitude and log line width 
for observable galaxies conditional on true distance, tq, or equivalently on true 
log distance, Wq. It follows from equation |^ that (j)(M, P\wo) is given by:- 

^^A/TPI ^ ^(M,P)g(M,P,^o) 

'^^'^'P""°) = //M/(M,P)5(M,P,^o)rfMdP 

Note that this distribution is independent of the local number density, n(r), of 
galaxies. Although this useful property of conditional distributions was pointed 
out by pOf , it would seem that its relevance to Tully-Fisher type relations has not 



been widely appreciated. The joint distribution of M and P at given distance 
is generally derived on the assumption of a uniform spatial number density (c.f. 
P7[ , 0). We see from equation |TD| that such an assumption is in fact unnecessary. 



and in particular 0(M, P|t(;o) is identical for both field and cluster galaxies 
provided of course that one can assume the intrinsic joint distribution \E'(M,P) 
to be independent of environment. 



2.3 BIAS OF GENERAL LINEAR ESTIMATOR 



Ignoring absorption and cosmological effects, the following equation relates the 
apparent and absolute magnitudes of a galaxy at given true log distance, t^o:- 



m = M + 5wo + K 



(11) 



Here k is a constant which depends upon our units of distance, e.g. if distances 
are measured in Mpc then k = 25. If distances are measured in kms~^ by tying 
the calibration of one's distance estimator to a cluster at some assumed redshift 
distance (as is commonly the case in the literature) then k = 15 — 5\ogh. 

A sensible form for a general linear estimator, Wql, of Wq is now clearly given 
by:- 



Wr 



= 0.2(m - M - k) = 0.2(m-aP -&-«:) (12) 

where M = aP + b, and a and b are constants, (c.f. eq. |l] above). 

By combining equations [T^, |TT] and ^ we can determine the joint distribution 
function of m and P for observable galaxies - and from that the pdf of Wql ~ 
conditional on Wq. There is a somewhat more direct route to the same result. 



however. Substituting equation [TT] back into equation |T2| and rearranging we 
obtain:- 

- Wo = 0.2 (M - aP - b) (13) 

Equation |13] is of little practical use in defining Wql since both M and Wq are 
unknown. However an expression for the bias of Wql now follows directly, viz:- 



0.2{E{M\w,,) - aE{P\wo) -6) 



(14) 
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where the expected values of M and P are with respect to the joint distribution 



function for observable galaxies given by equation |I0, 

Equation |1^ is valid for a completely general selection function of M, P and 
Wq. Consequently, the expected values of M and P are both, in general, functions 
of Wo and it is this fact which makes the complete elimination of Malmquist bias 
from a linear estimator impossible in the general case: one cannot choose values 



of the constants a and b which define the distance estimator so that equation in 



is identically zero for all true distances. To make any further progress towards 
identifying an unbiased distance estimator requires making some assumptions 
about the nature of the distribution function, 0(M, P|ido). 

2.4 schechter's solution for an unbiased estimator 

We can rewrite the intrinsic joint distribution function, \E'(M,P), of M and P 
as follows :- 

^(M,P) = ^(M)^(P|M) (15) 

\E'(M) is just the galaxy luminosity function, well described by e.g. a Schechter 
function or a gaussian, but regarded as an arbitrary function for the moment. 
Note that this factorisation does not require any assumption about in which 
variable lies the scatter in the TuUy-Fisher relation, but is valid in the com- 
pletely general (and more realistic!) case of scatter in both variables. Taking as 
our lead the approach of [^, suppose we now make the following two crucial 
assumptions:- 

1. the selection function is independent of P 

2. the conditional expectation of P given M is linear in M, i.e.:- 

E{P\M) = aM + P (16) 

where a and (3 are constants, equal to the slope and zero point of the regression 
line of P upon M. With these two assumptions equation |1^ reduces to:- 

B{wg^,Wo) = 0.2{{1- aa)E{M\wo) - b - aP) (17) 

from which one sees that if a = and b = —(3a~^ in equation O, then the 



bias of Wql is zero for all values of Wq. In other words this solution identifies an 
unbiased log distance estimator, Wj, viz:- 

Wi = 0.2(m-a-^(P-/5) - k) (18) 

We use the subscript T since this unbiased solution corresponds exactly to esti- 
mator one obtains from applying the inverse Tully-Fisher relation - i.e. regressing 
line widths on magnitudes - in complete concordance with Schechter's result. 

To fix these ideas with a specific example, consider again the case where M 
and P are jointly normally distributed. This case certainly satisfies the assump- 
tion that the conditional expectation of P given M be linear in M. Comparing 
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equations and [T^ we see that a = and (3 = Pg — p^Mg, which imphes the 



following expression for the unbiased 'inverse' estimator:- 

Wi = 0.2(m - Mo - — (P-Po) - (19) 

pap 

It is instructive to compare Wj with the 'direct' log distance estimator, w^, 
corresponding to the direct regression of magnitudes on line widths. The values 
of the constants a and b for this case follow from equation 0, and give:- 

wn = 0.2 (m - M„ - ^(P - P„) - k) (20) 

CTp 

which differs from equation ^ only in the switching of the correlation coefficient, 
p, from denominator to numerator, reflecting the different slope of the direct 
regression line (c.f. Figures (1) and (2). The bias of now follows from 
equation O, after a little reduction:- 



S(w„, w,) = 0.2 (1 - p^){E{M\wo) - Mo} (21) 

Several points emerge from this equation. Firstly notice that when p = 0, i.e. 
when P and M are uncorrelated, then the bias of reduces to the bias of the 
'naive' estimator, w^, of log distance defined in HS and |jl2| by:- 

w„ = 0.2(m- Mo - k) (22) 

i.e. assuming that all galaxies are standard candles of absolute magnitude. Mo, 
and ignoring the effects of Malmquist bias. This is not surprising, since when 
p = the measured log line width provides no additional information about 
the value of M. The second point to note is that as \p\ tends to unity, on the 
other hand, the bias of tends to zero at all true distances. Again this follows 
automatically from the fact that as |p| — > 1 the direct and inverse regression 
lines become coUinear, and Wq and Wi are identical. Lastly note that if there are 
no magnitude selection effects then Wq is again unbiased at all true distances, 
simply because we then have E(M.\wo) = Mq for all Wq- It is easy to see that 
this result is true for an arbitrary joint intrinsic distribution function, \E'(M, P), 
in the absence of selection effects. 

Finally we consider the risk and the higher moments of the Wql distribution. 
We can do this most easily by introducing a new random variable t = P — (aM + 



/3) . This allows us to rewrite equation |T3| as follows:- 



Wgl -Wo = 0.2 [(1 - aA)M - B - f3A - At]) (23) 

For the unbiased inverse estimator we see that all but the final term of the right 
hand side vanishes. It follows immediately from this that the moments of Wj — Wq 
are equal simply to a constant multiple of the moments of, t, independent of the 
true log distance! . Moreover, since we are assuming that £'(P|M) = aM + it 
follows that the probability distribution of Wj is identical in shape to the intrinsic 
conditional distribution, \E'(P|M). This latter distribution is generally modelled 
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to be gaussian (c.f. [0), thus implying that the inverse estimator is normally 



distributed, unbiased, and of constant variance at all true log distances. 



As we recalled in section |2.1| , these are precisely the properties assumed for 
the raw log distance estimator, 1^, in LB and LS. Our results confirm, therefore, 
that Wi is the correct raw log distance estimator to use in defining Malmquist 
corrections. 



It follows from equation ^ on the other hand, that will not be normally 
distributed for all w^, and in fact will lead to incorrect Malmquist corrections if 
these are derived on the assumption of a normal raw estimator. Notwithstand- 
ing this important result, to our knowledge a direct linear regression has been 
used exclusively to date in the derivation both homogeneous and inhomogeneous 



Malmquist corrections in the literature (c.f. [|14|, [|l|, 0, Q). We examine the 



consequences of this incorrect choice of raw distance estimator in |18] and |19 



2.5 PROPERTIES OF THE UNBIASED 'INVERSE' ESTIMATOR 

It is instructive to summarise the properties of the inverse estimator, Wi, which 
we have thus far confirmed or determined, and add several further results which 
follow easily from them. 

1. In a sample subject to observational selection effects, provided that the 
measurements of line width are selection-free and the conditional expec- 
tation of line width at given absolute magnitude is linear in M, then it 
is possible to define a general linear estimator of log distance which is 
unbiased at all true distances, and the appropriate linear combination cor- 
responds exactly to the estimator derived from a regression of (log) line 



widths upon magnitudes, as prescribed in p3[. This result is valid in the 
general case where one accounts for intrinsic and observational scatter in 
both variables, and does not require the assumption that the scatter lies 
only in line widths. 

2. The 'inverse' estimator thus defined is the only unbiased linear estimator 
of log distance. Any other linear combination of magnitude and log line 
width, and in particular any other regression line, yields an estimator which 
is biased at all true distances for a magnitude selection function. Examples 
of biased regression lines in this case include, therefore, not only the direct 



regression used by e.g. [Tj] (in its equivalent form for the — cr relation), 
but also the orthogonal regression (accounting for residuals on both observ- 
ables - c.f. 0); 'bisector' regression (i.e. the line which bisects the direct 
and inverse regression lines - c.f. |^) and mean (i.e. the line whose slope 
is the arithmetic mean of the direct and inverse lines - c.f. [|1^) regression 
lines. 

3. The shape of the pdf, and hence in particular the risk (or equivalently 
variance), of the inverse estimator is constant at all true distances. It 
follows from this property that confidence intervals derived from the inverse 
estimator, following the method outlined in HS, are of constant width. For 
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any other general linear estimator, on the other hand, the shape of the pdf 
is severely distorted at large true distances as luminosity selection effects 
become significant. 

The pdf of the inverse estimator is, in fact, identical in shape to the intrin- 
sic pdf of log line width conditional upon absolute magnitude. If the latter 
distribution is normal and of constant variance, as is commonly assumed, 
then so too will be the pdf of the inverse estimator. It is therefore the cor- 
rect choice of 'raw' log distance estimator for the derivation of Malmquist 
corrections. 

The unbiased property of the inverse estimator is true for an arbitrary 
luminosity function and magnitude selection function, and is independent 
of the true number density distribution of galaxies. This is a particularly 



useful property, since it follows from equations [1^ and [T^ that the bias 
of any other linear estimator will depend explicitly upon the form of the 
luminosity function and magnitude selection effects, so that any attempt 
to correct for or reduce the bias would necessarily be model dependent. 
Indeed, [0] shows that the magnitude of the bias of the direct regression line 
is substantially different for gaussian and Schechter luminosity functions. 

6. One may also define an unbiased log distance estimator for other distance 
indicators, including the D^ — a and magnitude-colour relations, subject 
to the same condition that there be one observable free from selection, but 
not requiring one observable to be distance- independent. In a diameter- 
complete survey, for example, one may construct an unbiased distance es- 
timator from the observed angular diameter and apparent magnitude. As 
above, it is straightforward to show that this unbiased estimator corre- 
sponds exactly to the regression of the selection-free observable upon the 
other observable. 

7. The inverse estimator is an unbiased estimator of log distance: consequently 
the corresponding distance estimator is biased. It is a simple matter, how- 
ever, to define a corresponding unbiased distance estimator, particularly in 
the case where Wj is normally distributed (c.f. ||14|| ). 

2.6 UNBIASED ESTIMATORS IN MORE REALISTIC CASES 
Although we have striven to show in this paper that the definition of unbiased 



estimators following the prescription of |^ rests on few assumptions and is 
otherwise a very general result, one must nevertheless accept that even these 
modest assumptions may not be met in most practical situations. In particular, 
if neither observable is free from selection effects then an unbiased estimator 
formally cannot be defined as a simple linear combination of the observables. 

In the context of both the Tully-Fisher and — a relations, however, the 
problem of P selection is somewhat less important than one might expect. Most 
surveys will be subject to a lower selection limit on line width or velocity dis- 
persion: e.g. it will not be possible to measure accurately velocity dispersions of 
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the order of ISOkms ^ [13]. The interesting - and very useful - property of this 



selection limit is, however, that it becomes increasingly less important at larger 
distances. This is easy to understand, since at large distances only intrinsically 
brighter (or larger), and thus sufficiently large line width, galaxies will be ob- 
servable. In other words, at large distances the galaxies which are 'lost' to the 
survey due to their small velocity widths would have been unobservable in any 
case, owing to their faint luminosity. 

As an illustrative example. Figure (3) shows the bias of the inverse log dis- 
tance estimator derived from the combined Virgo and Ursa Major calibrating 



sample of |21|], and assuming a sharp I-band magnitude limit at /lim = 14. The 
curves show the bias of Wi as a function of true distance (expressed in kms~^) for 
three different line width selection limits. Note that the effect of P selection is 
to introduce a positive bias - in contrast to the negative Malmquist bias caused 
by an upper limit on observable apparent magnitude. The effect is clearly very 
small, however. A bias of 0.01 in Wj corresponds to a systematic distance error of 
~ 2%. Hence, one sees that the effect of a line width limit as large as P^m = 200 
kms~^ is negligible, and even with a limit of Plim = 250 kms~^ the effect can still 
be ignored at cosmologically interesting distances in this case. 

In the event that selection effects on P are large enough to be significant - or, 
for example, if the line width selection cannot be well described by a sharp limit, 
independent of distance and morphological type - one can adopt an iterative 
method to reduce Malmquist bias - although such an approach will necessarily 
be model dependent. We discussed this method in HS, for the case of an estimator 
which is a function of apparent magnitude only - so that Schechter's ideas are 
inapplicable. The extension to estimators of Tully-Fisher type is straightforward, 
however. Let w(m, P) denote an estimator of log distance as before. Rearranging 
equation ^ observe that we may write :- 

E{w{in,P)\w,) = Wo + B{w,Wo) (24) 

This is essentially equation (19) of HS, in the equivalent form for an estimator 
of log distance. 

Although we cannot use equation ^to remove the bias of w(m, P) exactly, 
since the true log distance is unknown, suppose we form a new estimator, 
Wi(m,P), defined by:- 

Wi(m,P) = w(m,P) - S(w(m,P),u;o = w(m,P)) (25) 

In other words for each m and P we subtract from w(m, P) the bias of the 
estimator assuming that the true log distance is equal to its estimated value, 
(c.f. eq. (20) of HS). One can then compute the bias of the new estimator, 
Wi(m, P), apply equation ^ again to define W2(m, P) in terms of Wi, and so on. 

It is not obvious that the above iterative scheme will in all cases converge to 
an unbiased estimator. In fact we have shown that this is not the case for 
estimators which are functions of apparent magnitude only. Numerical studies 
indicate that convergence is achieved for the Tully-Fisher case with selection 
on both observables, however, provided that the scatter in the intrinsic joint 
distribution of M and P is not too large. 
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Perhaps a more serious problem in defining unbiased distance estimators lies 
in the calibration of the distance relation itself. In order to define the inverse 
estimator (or indeed the direct estimator), one must determine the parameters 
of the joint distribution of M and P - e.g. the five parameters Mq, Pq, ctm, <Jp 
and p in the bivariate normal case. It is obviously of great importance, therefore, 
to ensure that the estimates of these parameters obtained from one's calibrating 
sample accurately reflect their true intrinsic values. It has been suggested (c.f. 
p6l ) that the scatter measured in distance relations underestimates the true 



scatter - leading one to suppose a less serious contribution from Malmquist bias 
- simply because the number of calibrating galaxies is insufficient to accurately 
determine the slope and zero point of the relation. 

We have addressed this question in some quantitative detail, carrying out 
numerical experiments on artificial cluster samples of a range of different sizes 
and true parameters, in order to determine how many calibrators are required 
to achieve a given level of accuracy in the fitted TuUy-Fisher slope. As an il- 
lustration. Figure (4) shows the results of Monte Carlo simulations carried out 
assuming a bivariate normal model for the distribution of M and P and adopting 
as true parameter values those given by the Fornax cluster used in the calibration 
of the Mathewson galaxy survey (c.f. [0). The bold and dotted lines show the 
true inverse and direct regression line slopes respectively, while the two curves 
show la confidence limits for the estimated inverse regression line slope as a 
function of the number of galaxies in the calibrating sample. 

One can see from Figure (4) that for calibrating samples containing less than 
~ 40 galaxies, the dispersion of the estimated slope of the inverse regression line 
is greater than the difference between the true slopes of the inverse and direct 
regression lines. Hence one requires a calibrating sample of over 40 galaxies in 
order that the scatter in the slope of the inverse regression line due to sampling 
error be smaller than the difference between the slopes of the two lines. 

Putting this another way, with a considerably smaller sample of calibrators 
there is a strong possibility that the bias in the (supposedly unbiased!) inverse 
estimator due to incorrect determination of the estimator slope will be larger 
than the Malmquist bias of the direct estimator. 

Clearly, then, it is important to use as large a calibrating sample as possible 
to minimise this problem. One solution is to combine data from several different 
clusters, as in ||21| and |T^, combining two samples from the Virgo and Ursa 



Major clusters, whose distance moduli have been found to be equal. 0) discuss 
two different methods of tackling the problem of combining calibration data from 
clusters at different distances, and obtaining optimal estimates of the slope and 
zero point of the distance relation simultaneously with relative distances to each 
cluster. 

Of course another way in which the problems of sampling error can be reduced 
is by identifying distance relations of intrinsically smaller scatter. In section ^ we 
consider how one might achieve this by defining estimators which are functions 
of more than two observables. 
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3 ESTIMATORS OF DISTANCE USING THREE OR MORE 



OBSERVABLES 

In this section we briefly discuss the properties of distance estimators which are 
defined as a function of apparent magnitude and two other observable quantities, 
such as one might define in extending the Tully-Fisher relation to include the 
observed angular diameter of spiral galaxies. Of particular interest is the ques- 
tion of whether one may still define unbiased estimators in this case, analogous 
to the P on M estimator of the previous section, and if so whether one may 
construct unbiased estimators which have a smaller risk than their two-variable 
counterparts. 

One can carry out an analysis which follows closely the formulation adopted 
in section ^: i.e. first one derives the joint distribution at given true log distance. 
Wo, after accounting for observational selection effects, of the random variables 
- M,P and D say, denoting for example absolute magnitude, log line width 
and log of absolute diameter - in terms of their intrinsic joint distribution and 



selection function to obtain an expression analogous to equation 10, viz: 



.... P ni ^ vl/(M,P,D)g(M,P,D,^o) 

One can then determine, for a general linear combination of the observables, the 
distribution, bias and risk of this 'general linear' estimator and, as before, identify 
for which values the estimator is unbiased. The details of these calculations are 
somewhat tedious and add little to the previous analysis for two variables. We 
present, therefore, a summary of the main results for the 3- variable case. 

We considered two cases: firstly where only one of the three observables is free 
from observational selection, and secondly where two observables are selection- 
free. In both cases it was possible to define an unbiased estimator of log distance 
by appropriate linear combination of the observables. The values of the coeffi- 
cients corresponding to the unbiased solution were given in terms of the parame- 
ters of the intrinsic distribution function, \E'(M, P, D), as in the two variable case. 
To take a specific example, if M, PandD were jointly normally distributed, then 
the coefficients depend solely upon the mean values, dispersions and correlation 
coefficients of the trivariate normal distribution. 

In the first case where only one observable is selection-free, we found that an 
unbiased estimator can, in general, be defined only as a linear combination of all 
three observables. This has important consequences for our earlier results. In the 
case of the Tully-Fisher relation, for example, if one's sample is subject to both 
diameter and magnitude selection then the inverse estimator defined in section 
using only apparent magnitude and log line width will no longer be unbiased. 
This is because the selection on diameters affects the joint distribution of m 
and P, since the galaxy diameter is correlated with these variables. A similar 
effect is discussed in |1^, where selection on diameter and surface brightness 
'pollutes' the distribution of m and P and affects the bias of the Tully-Fisher 
relation. Clearly, therefore, great care must be taken in ensuring no additional 
observables introduce selection 'by proxy' into one's samples. The fact that an 
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observable does not appear in the definition of one's distance estimator does not 
imply that it can have no effect on the bias of that estimator. 

In the second case, where two observables are free from selection, a rather 
different picture emerges. Taking again the example of magnitude, line width 
and diameter to fix ideas, we found that in this case the inverse estimator defined 
in section || is still unbiased at all true distances, so that Schechter's prescription 
is still be valid. The inverse estimator is, however, no longer the only unbiased 
estimator of log distance - although it is still the only unbiased estimator formed 
from a linear combination of magnitude and line width alone. By forming an 
estimator from three observables, we have sufficient freedom to define an unbiased 
estimator of minimum variance, and one may show that the variance of this 
optimal 3-variable estimator is always less than or equal to that of the inverse 
estimator defined by magnitude and line width alone. 

The precise factor. A, by which the addition of a third observable, D, reduces 
the variance of the inverse estimator depends only upon the values of the corre- 
lation coefficients between the three observables (c.f. [0). As an illustration, 
consider the specific case where M, P and D are jointly normally distributed, 
with correlation coefficients denoted by Pmp, Pmd, and pp^- In this case A is given 
by the following expression:- 

^ _ Pmp[ ^ ~ (Pmp + Pmd + Ppd) + ^PmpPmdPpd ] ^2y^ 
~"Pmp)[Pmp ~ 2PmpPmdPpd + Pmd] 

Figures (5), (6) and (7) show respectively scatter diagrams for the I-band mag- 
nitude versus log line width, magnitude versus log diameter and log diameter 
versus log line width relations for the Fornax cluster, determined from the Math- 
ewson galaxy redshift survey. It is clear from these figures that a very good 
correlation exists between all three observables, and the correlation coefficients 
for this calibrating sample were found to be Pmp = —0.985, Pmd = —0.963 and 
Ppd = 0.942. Notwithstanding the fact that the Fornax cluster is a rather small 



calibrating sample, in the light of our remarks in section |2.6| , if we assume these 
correlation coefficients to be equal to the intrinsic values for the magnitude - 
diameter - line width relation then substituting in equation ^ gives a value of 
A = 0.64. In other words the variance of the 3 variable estimator is more than 
35% smaller than that of the corresponding P on M estimator. This corresponds 
to a reduction in the mean distance error dispersion from ~ 20% to around 15%. 

It would seem clear, therefore, that utilising the measurements of a third 
observable can offer a means of signifcantly reducing the dispersion of unbiased 
distance estimators, and thus obtaining more reliable distance estimates. When 
such an observable is available - as is the case in the above example of the 
magnitude - diameter - linewidth relation, its use would seem to be strongly 
advised. 

4 CONCLUSIONS 

In this paper we have studied the properties of galaxy distance estimators de- 
rived from combining measurements of two or more observables, as is the case 
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for the Tully-Fisher and — cr relations. We have considered the effects of ob- 
servational selection upon the distribution, bias and risk of these estimators and 
have established that, subject to modest but crucial assumptions, it is possible to 
define estimators which are unbiased at all true distances, in confirmation of the 



results of [23]. We have shown that these results are more general than is often 
assumed in the literature: in particular, that one can define unbiased distance 
estimators independently of the form of the magnitude selection function and the 
local number density of galaxies, and almost independently of the intrinsic joint 
distribution of magnitude and line width. Moreover, the results are derived in the 
general case of observational and intrinsic scatter on both correlated variables. 

We have compared our treatment of Malmquist bias with other approaches 
which have been adopted in the literature, and shown how the differences be- 
tween them can be understood as fundamentally different interpretations of the 
nature of probability. Moreover, we have shown that when the distribution of 
log line widths conditional on magnitudes is normal, then so too is the pdf of 
the unbiased inverse estimator. It is therefore the only appropriate choice of raw 
log distance estimator which is consistent with the assumptions made in deriving 
homogeneous and inhomogeneous Malmquist corrections in the literature. 

Finally, we have also considered how one can define unbiased estimators of 
smaller variance by utilising additional, suitably correlated, observables. In fu- 
ture work we will apply these multivariate estimators to the analysis of real galaxy 
surveys, in order to extend and improve the optimal techniques for smoothing 
and recovery of the peculiar velocity field described in [113] and [El . 
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Figure 1. Schematic Tully-Fisher relations, derived by applying a di- 
rect and inverse linear regression to a complete calibrating sample 
— e.g. a nearby cluster. 

Figure 2. The expected value of absolute magnitude conditional upon 
log line width, and log line width conditional upon magnitude, 
in a distribution subject to a sharp selection limit on absolute 
magnitude — e.g. a distant cluster. The shaded region represents 
unobservable galaxies. 

Figure 3. Bias of the inverse estimator with line width selection 
effects, assuming a bivariate normal distribution for M and P with 
distribution parameters taken from the Virgo and Ursa Major 



composite calibrating sample of ||2T 



Figure 4. la confidence limits for the sample estimate of the slope of 
the inverse regression line as a function of the number of galaxies 
in the calibrating sample. Distribution parameters are taken from 
the Fornax cluster — as determined in the Mathewson galaxy 
survey. 

Figure 5. Scatter plot of the Tully-Fisher, I-band magnitude versus 
log line width, relation for the Fornax cluster, derived from the 
Mathewson redshift survey. 

Figure 6. Scatter plot of the I-band magnitude versus log diame- 
ter relation for the Fornax cluster, derived from the Mathewson 
redshift survey. 

Figure 7. Scatter plot of the log diameter versus log line width rela- 
tion for the Fornax cluster, derived from the Mathewson redshift 
survey. 
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